15 research outputs found

    A New Approach for Sparse Matrix Classification Based on Deep Learning Techniques

    Get PDF
    In this paper, a new methodology to select the best storage format for sparse matrices based on deep learning techniques is introduced. We focus on the selection of the proper format for the sparse matrix-vector multiplication (SpMV), which is one of the most important computational kernels in many scientific and engineering applications. Our approach considers the sparsity pattern of the matrices as an image, using the RGB channels to code several of the matrix properties. As a consequence, we generate image datasets that include enough information to successfully train a Convolutional Neural Network (CNN). Considering GPUs as target platforms, the trained CNN selects the best storage format 90.1% of the time, obtaining 99.4% of the highest SpMV performance among the tested formatsThis work has been supported by MINECO (TIN2014-54565-JIN and MTM2016-76969-P), Xunta de Galicia (ED431G/08) and European Regional Development Fun

    A unified framework to improve the interoperability between HPC and Big Data languages and programming models

    Get PDF
    One of the most important issues in the path to the convergence of HPC and Big Data is caused by the differences in their software stacks. Despite some research efforts, the interoperability between their programming models and languages is still limited. To deal with this problem we introduce a new computing framework called IgnisHPC, whose main objective is to unify the execution of Big Data and HPC workloads in the same framework. IgnisHPC has native support for multi-language applications using JVM and non-JVM-based languages. Since MPI was used as its backbone technology, IgnisHPC takes advantage of many communication models and network architectures. Moreover, MPI applications can be directly executed in an efficient way in the framework. The main consequence is that users could combine in the same multi-language code HPC tasks (using MPI) with Big Data tasks (using MapReduce operations). The experimental evaluation demonstrates the benefits of our proposal in terms of performance and productivity with respect to other frameworks. IgnisHPC is publicly available for the Big Data and HPC research communityThis work has been supported by MICINN, Spain (RTI2018-093336-B-C21, PLEC2021-007662), Xunta de Galicia, Spain (ED431G/08, ED431G-2019/04 and ED431C-2018/19) and the European Regional Development Fund (ERDF)S

    Real-time focused extraction of social media users

    Get PDF
    In this paper, we explore a real-time automation challenge: the problem of focused extraction of Social Media users. This challenge can be seen as a special form of focused crawling where the main target is to detect users with certain patterns. Given a specific user profile, the task consists of rapidly ingesting Social Media data and early detecting target users. This is a real-time intelligent automation task that has numerous applications in domains such as safety, health or marketing. The volume and dynamics of Social Media contents demand efficient real-time solutions able to predict which users are worth to explore. To meet this aim, we propose and evaluate several methods that effectively allow us to harvest relevant users. Even with little contextual information (e.g., a single user submission), our methods quickly focus on the most promising users. We also developed a distributed microservice architecture that supports real-time parallel extraction of Social Media users. This modular architecture scales up in clusters of computers and it can be easily adapted for user extraction in multiple domains and Social Media sources. Our experiments suggest that some of the proposed prioritisation methods, which work with minimal user context, are effective at rapidly focusing on the most relevant users. These methods perform satisfactorily with huge volumes of users and interactions and lead to harvest ratios 2 to 9 times higher than those achieved by random prioritisationThis work was supported in part by the Ministerio de Ciencia e Innovación (MICINN) under Grant RTI2018-093336-B-C21 and Grant PLEC2021-007662; in part by Xunta de Galicia under Grant ED431G/08, Grant ED431G-2019/04, Grant ED431C 2018/19, and Grant ED431F 2020/08; and in part by the European Regional Development Fund (ERDF)S

    A Big Data Platform for Real Time Analysis of Signs of Depression in Social Media

    Get PDF
    In this paper we propose a scalable platform for real-time processing of Social Media data. The platform ingests huge amounts of contents, such as Social Media posts or comments, and can support Public Health surveillance tasks. The processing and analytical needs of multiple screening tasks can easily be handled by incorporating user-defined execution graphs. The design is modular and supports different processing elements, such as crawlers to extract relevant contents or classifiers to categorise Social Media. We describe here an implementation of a use case built on the platform that monitors Social Media users and detects early signs of depressionThis work was funded by FEDER/Ministerio de Ciencia, Innovación y Universidades—Agencia Estatal de Investigación/ Project (RTI2018-093336-B-C21). Our research also receives financial support from the Consellería de Educación, Universidade e Formación Profesional (accreditation 2019–2022 ED431G-2019/04, ED431C 2018/29, ED431C 2018/19) and the European Regional Development Fund (ERDF), which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University SystemS

    Ignis: An efficient and scalable multi-language Big Data framework

    Get PDF
    Most of the relevant Big Data processing frameworks (e.g., Apache Hadoop, Apache Spark) only support JVM (Java Virtual Machine) languages by default. In order to support non-JVM languages, subprocesses are created and connected to the framework using system pipes. With this technique, the impossibility of managing the data at thread level arises together with an important loss in the performance. To address this problem we introduce Ignis, a new Big Data framework that benefits from an elegant way to create multi-language executors managed through an RPC system. As a consequence, the new system is able to execute natively applications implemented using non-JVM languages. In addition, Ignis allows users to combine in the same application the benefits of implementing each computational task in the best suited programming language without additional overhead. The system runs completely inside Docker containers, isolating the execution environment from the physical machine. A comparison with Apache Spark shows the advantages of our proposal in terms of performance and scalabilityThis work has been supported by MICINN, Spain (RTI2018-093336-B-C21), Xunta de Galicia, Spain (ED431G/08 and ED431C-2018/19) and European Regional Development Fund (ERDF)S

    Colaboración entre docentes de una universidad alemana y una española para el desarrollo de seminarios prácticos acerca de la credibilidad de la información

    Get PDF
    En esta experiencia docente, presentamos una colaboración internacional entre investigadores de la Universidad de Regensburg (Baviera, Alemania) y la Universidad de Santiago de Compostela (Galicia, España), en la que el alumnado participa en dos materias de Ciencias de la Información. Se propone un enfoque en el que los estudiantes construyen su propia comprensión de las tecnologías mediante el planteamiento de un problema en el contexto de un proyecto de investigación. El área de interés en la que se propone el desafío se orienta a cómo ayudar a los usuarios finales de las tecnologías de información a establecer la credibilidad de la información. Para ello, se ha trabajado con contenidos online relacionados con la pandemia de COVID-19.In this teaching experience, we present an international collaboration between researchers from a German university and a Spanish university, in which students participate in two Information Science courses. An approach is proposed in which students build their own understanding of technologies by posing a problem in the context of a research project. The area of interest in which the challenge is proposed is oriented towards how to help end-users of information technologies to establish the credibility of information. To this end, work has been done with online content related to the COVID-19 pandemic.Este trabajo ha sido financiado por FEDER/Ministerio de Ciencia, Innovación y Universidades – Agencia Estatal de Investigación (RTI2018-093336-B-C21), y también ha recibido apoyo financiero de la Consellería de Educación, Universidade e Formación Profesional (acreditación 2019-2022 ED431G-2019/04, ED431C 2018/29, ED431C 2018/19) y de los fondos FEDER, que reconocen al CiTIUS de la USC como Centro de Investigación del Sistema Universitario de Galicia

    An unsupervised perplexity-based method for boilerplate removal

    Get PDF
    The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approachS

    Using an extended Roofline Model to understand data and thread affinities on NUMA systems

    Get PDF
    Today’s microprocessors include multicores that feature a diverse set of compute cores and onboard memory subsystems connected by complex communication networks and protocols. The analysis of factors that affect performance in such complex systems is far from being an easy task. Anyway, it is clear that increasing data locality and affinity is one of the main challenges to reduce the access latency to data. As the number of cores increases, the influence of this issue on the performance of parallel codes is more and more important. Therefore, models to characterize the performance in such systems are broadly demanded. This paper shows the use of an extension of the well known Roofline Model adapted to the main features of the memory hierarchy present in most of the current multicore systems. Also the Roofline Model was extended to show the dynamic evolution of the execution of a given code. In order to reduce the overheads to get the information needed to obtain this dynamic Roofline Model, hardware counters present in most of the current microprocessors are used. To illustrate its use, two simple parallel vector operations, SAXPY and SDOT, were considered. Different access strides and initial location of vectors in memory modules were used to show the influence of different scenarios in terms of locality and affinity. The effect of thread migration were also considered. We conclude that the proposed Roofline Model is an useful tool to understand and characterise the behaviour of the execution of parallel codes in multicore systemsThis work has been partially supported by the Ministry of Education and Science of Spain, FEDER funds under contract TIN 2010-17541, and Xunta de Galicia, EM2013/041. It has been developed in the framework of the European network HiPEAC and the Spanish network CAPAP-HS

    SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

    Get PDF
    Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 licenseThis work was supported by Ministerio de Economía y Competitividad (Spain) (http://www.mineco.gob.es) grants TIN2013-41129-P and TIN2014-54565-JIN. There was no additional external funding received for this studyS

    PoS tagging and Named Entitiy Recognition in a Big Data environment

    Get PDF
    Este artículo describe una suite de módulos lingüísticos para el castellano, basado en una arquitectura en tuberías, que incluye tareas de análisis morfosintáctico así como de reconocimiento y clasificación de entidades nombradas. Se han aplicado técnicas de paralelización en un entorno Big Data para conseguir que la suite de módulos sea más eficiente y escalable y, de este modo, reducir de forma significativa los tiempos de cómputo con los que poder abordar problemas a la escala de la Web. Los módulos han sido desarrollados con técnicas básicas para facilitar su integración en entornos distribuidos, con un rendimiento próximo al estado del arte.This article describes a suite of linguistic modules for the Spanish language based on a pipeline architecture, which contains tasks for PoS tagging and Named Entity Recognition and Classification (NERC). We have applied run-time parallelization techniques in a Big Data environment in order to make the suite of modules more efficient and scalable, and thereby to reduce computation time in a significant way. Therefore, we can address problems at Web scale. The linguistic modules have been developed using basic NLP techniques in order to easily integrate them in distributed computing environments. The qualitative performance of the modules is close the state of the art.Este trabajo ha sido subvencionado con cargo a los proyectos HPCPLN - Ref:EM13/041 (Programa Emergentes, Xunta de Galicia), Celtic - Ref:2012-CE138 y Plastic - Ref:2013-CE298 (Programa Feder-Innterconecta)
    corecore